7 research outputs found
Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification
Voice activity detection (VAD), which classifies frames as speech or
non-speech, is an important module in many speech applications including
speaker verification. In this paper, we propose a novel method, called
self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD
into a deep speaker embedding system. The proposed method is a combination of
the following two approaches. The first approach is soft VAD, which performs a
soft selection of frame-level features extracted from a speaker feature
extractor. The frame-level features are weighted by their corresponding speech
posteriors estimated from the DNN-based VAD, and then aggregated to generate a
speaker embedding. The second approach is self-adaptive VAD, which fine-tunes
the pre-trained VAD on the speaker verification data to reduce the domain
mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes,
namely speech posterior-based DA (SP-DA) and joint learning-based DA (JL-DA).
Experiments on a Korean speech database demonstrate that the verification
performance is improved significantly in real-world environments by using
self-adaptive soft VAD.Comment: Accepted at 2019 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU 2019
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Currently, the most widely used approach for speaker verification is the deep
speaker embedding learning. In this approach, we obtain a speaker embedding
vector by pooling single-scale features that are extracted from the last layer
of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes
multi-scale features from different layers of the feature extractor, has
recently been introduced and shows superior performance for variable-duration
utterances. To increase the robustness dealing with utterances of arbitrary
duration, this paper improves the MSA by using a feature pyramid module. The
module enhances speaker-discriminative information of features from multiple
layers via a top-down pathway and lateral connections. We extract speaker
embeddings using the enhanced features that contain rich speaker information
with different time scales. Experiments on the VoxCeleb dataset show that the
proposed module improves previous MSA methods with a smaller number of
parameters. It also achieves better performance than state-of-the-art
approaches for both short and long utterances.Comment: Accepted to Interspeech 202
Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification
In this paper, we propose a new pooling method called spatial pyramid
encoding (SPE) to generate speaker embeddings for text-independent speaker
verification. We first partition the output feature maps from a deep residual
network (ResNet) into increasingly fine sub-regions and extract speaker
embeddings from each sub-region through a learnable dictionary encoding layer.
These embeddings are concatenated to obtain the final speaker representation.
The SPE layer not only generates a fixed-dimensional speaker embedding for a
variable-length speech segment, but also aggregates the information of feature
distribution from multi-level temporal bins. Furthermore, we apply deep length
normalization by augmenting the loss function with ring loss. By applying ring
loss, the network gradually learns to normalize the speaker embeddings using
model weights themselves while preserving convexity, leading to more robust
speaker embeddings. Experiments on the VoxCeleb1 dataset show that the proposed
system using the SPE layer and ring loss-based deep length normalization
outperforms both i-vector and d-vector baselines.Comment: 5 pages, 2 figures, Interspeech 201